R is also capabale of reading a variety of other dataset formats including: .dta, .sas, .xlsx, xls, txt, etc.
The type of library used to read these files will have implications for quickly your computer can read the data
Let us look at some examples
read_csv
read.csv
Importing Data into R
read_csv
library("readr")#This is to set the directorysetwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week3/")# Reading datadata_1851_obs10000<-read_csv("./data/data_1851_obs10000.csv")#Recording how long it takes#Step1:Recoding your system's timestart_time <-Sys.time()#Step2:Loding the datadata_1851_obs10000<-read_csv("./data/data_1851_obs10000.csv")#Step3: Recording when it finishesend_time <-Sys.time()#Step4: Calculating the differencetime_taken_a <- end_time - start_time#Step5: Priting the differencetime_taken_a
Time difference of 0.1764569 secs
Importing Data into R
read.csv
# Reading data#Step1:Recoding your system's timestart_time <-Sys.time()#Step2:Loding the datadata_1851_obs10000_b<-read.csv("./data/data_1851_obs10000.csv")#Step3: Recording when it finishesend_time <-Sys.time()#Step4: Calculating the differencetime_taken_b <- end_time - start_time#Step5: Priting the differencetime_taken_b
Time difference of 0.2362909 secs
Importing Data into R
read.csv
read_csv
read.csv or read_csv makes a difference
the difference is: time_taken_b - time_taken_a = 0.0598
This is hugely important for large datasets
Alernative File Formats
csv
xlsx and xls
spss
Stata
Examples of Reading Different Files
#Reading CSV filescsv_file<-read.csv("./data/data_1851_obs10000.csv")#Reading a Stata file#Library for reading Stata fileslibrary("haven")#Library for reading SPSS fileslibrary("foreign")stata_file<-read_dta("./data/data_1851_obs10000.dta")spss_file<-read.spss("./data/data_1851_obs10000.sav", to.data.frame=TRUE)#Library for reading Excel filelibrary("readxl")excel_file<-read_excel("./data/data_1851_obs10000.xlsx")
Tidy Data
Tidy Data = “standard way of mapping the meaning of a dataset to its structure.” (Hadley Wickham)
Within tidy data:
each variable forms a column
each observation forms a row
each cell is a single measurement
Tidy data means that all datasets are alike
Tidy Data
Tidy Data
Tidy Data
Tidy Data
Tidy datasets are all alike
Tidy Data
Tidy datasets are all alike
Tidy Data
Tidy datasets are all alike
Untidy Data
Messy data can be messy in their own way.
Untidy Data
Messy data can be messy in their own way.
Untidy Data
Messy data can be messy in their own way.
Untidy Data
Messy data can be messy in their own way.
Untidy Data
Messy data can be messy in their own way.
Tidy Data
Tidy Data
You can become friends with tidy data by following one of the following strategies:
Pivoting
Longer
Wider
Separating
Uniting
Untidy Data 1: Pivoting Longer
A common problem is that column names are not names of variables, but values of a variable
#table4a#> # A tibble: 3 × 3#> country `1999` `2000`#> * <chr> <int> <int>#> 1 Afghanistan 745 2666#> 2 Brazil 37737 80488#> 3 China 212258 213766
We have columns whose names are values, not variables: 1999 and 2000
We need a new column that reflects these two variable. Let’s call it year
We also need another column that reflects the numbers. Let’s call it cases
Untidy Data 1: Pivoting Longer
Untidy Data 1: Pivoting Longer
Untidy Data 1: Pivoting Longer
This is what that looks like in code
Original
#table4a#> # A tibble: 3 × 3#> country `1999` `2000`#> * <chr> <int> <int>#> 1 Afghanistan 745 2666#> 2 Brazil 37737 80488#> 3 China 212258 213766
Fix
#table4a %>% # pivot_longer(c(`1999`, `2000`), # names_to = "year",# values_to = "cases")#> # A tibble: 6 × 3#> country year cases#> <chr> <chr> <int>#> 1 Afghanistan 1999 745#> 2 Afghanistan 2000 2666#> 3 Brazil 1999 37737#> 4 Brazil 2000 80488#> 5 China 1999 212258#> 6 China 2000 213766
Untidy Data 2: Pivoting Wider
This where one observation is scattered across multiple rows.
Within the table below, an observation is a country in a year, but each observation is spread across two rows.
#table4a#> # A tibble: 12 × 4#> country year type count#> <chr> <int> <chr> <int>#> 1 Afghanistan 1999 cases 745#> 2 Afghanistan 1999 population 19987071#> 3 Afghanistan 2000 cases 2666#> 4 Afghanistan 2000 population 20595360#> 5 Brazil 1999 cases 37737#> 6 Brazil 1999 population 172006362#> # … with 6 more rows
Untidy Data 2: Pivoting Wider
Untidy Data 2: Pivoting Wider
Untidy Data 2: Pivoting Wider
This is what that looks like in code
Original
#table4a#> # A tibble: 12 × 4#> country year type count#> <chr> <int> <chr> <int>#> 1 Afghanistan 1999 cases 745#> 2 Afghanistan 1999 population 19987071#> 3 Afghanistan 2000 cases 2666#> 4 Afghanistan 2000 population 20595360#> 5 Brazil 1999 cases 37737#> 6 Brazil 1999 population 172006362#> # … with 6 more rows
Fix
#table4a %>% #> pivot_wider(names_from = type, #> values_from = count)#> # A tibble: 6 × 4#> country year cases population#> <chr> <int> <int> <int>#> 1 Afghanistan 1999 745 19987071#> 2 Afghanistan 2000 2666 20595360#> 3 Brazil 1999 37737 172006362#> 4 Brazil 2000 80488 174504898#> 5 China 1999 212258 1272915272#> 6 China 2000 213766 1280428583
Untidy Data 3: Separating
We sometimes may have columns that contain data which should separated in multiple columns
This where we use separate()
#table3#> # A tibble: 6 × 3#> country year rate #> * <chr> <int> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> 4 Brazil 2000 80488/174504898 #> 5 China 1999 212258/1272915272#> 6 China 2000 213766/1280428583
Untidy Data 3: Separating
Untidy Data 3: Separating
Untidy Data 3: Separating
This is what that looks like in code
Original
#> # A tibble: 6 × 3#> country year rate #> * <chr> <int> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> 4 Brazil 2000 80488/174504898 #> 5 China 1999 212258/1272915272#> 6 China 2000 213766/1280428583
Fix
#table3 %>% # separate(rate, into = c("cases", "population"))#> # A tibble: 6 × 4#> country year cases population#> <chr> <int> <chr> <chr> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272#> 6 China 2000 213766 1280428583
Untidy Data 4: Uniting
Uniting is the inverse of separating
It combines multiple columns into a single column
#table5 %>% #> # A tibble: 6 × 4#> country century year rate #> <chr> <chr> <chr> <chr> #> 1 Afghanistan 19 99 745/19987071 #> 2 Afghanistan 20 00 2666/20595360 #> 3 Brazil 19 99 37737/172006362 #> 4 Brazil 20 00 80488/174504898 #> 5 China 19 99 212258/1272915272#> 6 China 20 00 213766/1280428583
Untidy Data 4: Uniting
Untidy Data 4: Uniting
Untidy Data 4: Uniting
This is what that looks like in code
Original
#table5#> # A tibble: 6 × 4#> country century year rate #> <chr> <chr> <chr> <chr> #> 1 Afghanistan 19 99 745/19987071 #> 2 Afghanistan 20 00 2666/20595360 #> 3 Brazil 19 99 37737/172006362 #> 4 Brazil 20 00 80488/174504898 #> 5 China 19 99 212258/1272915272#> 6 China 20 00 213766/1280428583
Fix
#table5 %>% # unite(new, century, year, sep = "")#> # A tibble: 6 × 3#> country new rate #> <chr> <chr> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> 4 Brazil 2000 80488/174504898 #> 5 China 1999 212258/1272915272#> 6 China 2000 213766/1280428583
In the real world, people typically have to work with many tables
Multiple tables of data that need to be combined together, are called relational data
Verbs for Relational Data
Mutating joins
add new variables to one data frame from matching observations in another
Filtering joins
filter observations from one data frame based on whether or not they match an observation in the other table
Set operations
treat observations as if they were set elements
Examples: NYC Flights
We will first load two libraries
library(tidyverse)library(nycflights13)
There are five databases (tibbles) inside nycflights13:
airlines
airports
planes
weather
flights
The airlines tibble
airlines
# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
The airports tibble
airports
# A tibble: 1,458 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# ℹ 1,448 more rows